Application of Information Retrieval Techniques for Source Code Authorship Attribution
نویسندگان
چکیده
Authorship attribution assigns works of contentious authorship to their rightful owners solving cases of theft, plagiarism and authorship disputes in academia and industry. In this paper we investigate the application of information retrieval techniques to attribution of authorship of C source code. In particular, we explore novel methods for converting C code into documents suitable for retrieval systems, experimenting with 1,597 student programming assignments. We investigate several possible program derivations, partition attribution results by original program length to measure effectiveness of modest and lengthy programs separately, and evaluate three different methods for interpreting document rankings as authorship attribution. The best of our methods achieves an average of 76.78% classification accuracy for a one-in-ten classification problem which is competitive against six existing baselines. The techniques that we present can be the basis of practical software to support source code authorship investigations.
منابع مشابه
Comparing techniques for authorship attribution of source code
Attributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non-natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of n tokens/bytes (n-g...
متن کاملPoster: Source Code Authorship Attribution
As information becomes widely available and easily accessible through the Internet and other sources, the trend of plagiarism has been increasing. Plagiarism and copyright infringement are issues that come up in both academic and corporate environments. We need author classification techniques to inhibit such unethical violations. Source code is also intellectual property and reflects individua...
متن کاملA survey of modern authorship attribution methods
Authorship attribution supported by statistical or computational methods has a long history starting from 19th century and marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed Federalist Papers. During the last decade, this scientific field has been developed substantially taking advantage of research advances in areas such as machine learning, informati...
متن کاملStudying Users’ Emotions Attribution Style in Information Retrieval Based on Weiner’s Emotion Attribution Theory
Background and Aim: This research aimed to study emotions attribution style of users in information retrieval based on Weiner's theory. Methods: The survey method was used in this study. Population consisted of graduate students in humanities at Imam Reza (AS) International University. Sample of 72 students was selected. Data was collected by attribution style questionnaire (ASQ) and two resea...
متن کاملSuppοrting the Cybercrime Investigation Process: Effective Discrimination of Source Code Authors Based on Byte-level Information
Source code authorship analysis is the particular field that attempts to identify the author of a computer program by treating each program as a linguistically analyzable entity. This is usually based on other undisputed program samples from the same author. There are several cases where the application of such a method could be of a major benefit, such as tracing the source of code left in the...
متن کامل